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Abstract 

As  anticipated  in  True  Names  by  Vernor  Vinge,  identity  has 
been  recognized  as  our  most  valued  possession  in  cyberspace. 
Attribution  is  a  key  concept  in  enabling  trusted  identities  and 
deterring  malicious  activities.  As  more  people  use  the  Web  to 
communicate,  work,  and  otherwise  have  fun,  is  it  possible  to 
uniquely  identify  someone  based  on  their  Web  browsing  be¬ 
havior  or  to  differentiate  between  two  persons  based  solely  on 
their  Web  browsing  histories?  Based  on  a  user  study,  this  pa¬ 
per  provides  some  insights  into  these  questions.  We  describe 
characteristic  features  of  Web  browsing  behavior  and  present 
our  algorithm  and  analysis  of  an  ensemble  learning  approach 
leveraging  from  those  features  for  user  authentication. 

1  Introduction 

The  problem  of  user  identity  is  one  of  the  fundamental  and 
still  largely  unresolved  problems  of  cyberspace,  testing  the 
boundary  between  trust  and  privacy.  Multiple  approaches 
have  been  proposed  to  solve  this  problem  through  consoli¬ 
dated  password  schemes  (e.g.,  OpenID  (Thibeau  and  Reed 
2009),  Firefox’s  Persona  (Mills  2011)).  On  the  other  hand, 
the  popularity  of  social  media  such  as  Facebook  and  Twit¬ 
ter  have  made  possible  the  availability  of  large  amount  of 
spontaneous  online  usage  behavior  ripe  for  analysis  and  in¬ 
dividual  search  history  patterns  are  already  used  by  Google 
to  personalize  search  results.  Reality  mining  (Pentland  and 
Pentland  2008)  captures  unconscious  patterns  of  behavior 
through  signals  obtained  from  wearable  mobile  computing 
devices  to  reveal  personal  characteristics  in  order  to  shape 
human  interaction.  As  our  interaction  with  the  Web  becomes 
more  natural  and  even  mediates  our  interaction  with  others 
(Turkle  2012),  we  claim  that  Web  browsing  behavior  can 
be  rich  enough  to  uniquely  characterize  who  we  are  through 
unconscious  behavioral  patterns  and  authenticate  ourselves 
with  a  cognitive  personal  fingerprint. 

Attribution  is  broadly  defined  as  the  assignment  of  an  ef¬ 
fect  to  a  cause.  We  differentiate  between  authentication  and 
identification  as  two  techniques  for  the  attribution  of  iden¬ 
tity.  Authentication  is  defined  as  the  verification  of  claimed 
identification  (Jain,  Bolle,  and  Pankanti  1999).  Identifica¬ 
tion  involves  recognition  as  a  one-to-many  matching  prob¬ 
lem  while  authentication  is  a  one-to-one  matching  problem. 
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While  biometric  methodologies  strive  to  provide  instant  au¬ 
thentication  results,  this  paper  focuses  on  the  continuous  au¬ 
thentication  problem  where  authentication  is  made  over  time 
through  the  monitoring  of  activities. 

The  paper  is  organized  as  follows.  In  Section  2,  we  briefly 
describe  prior  research  on  the  modeling  of  Web  brows¬ 
ing  behavior  and  attribution  in  cyberspace.  In  Section  3, 
we  present  our  descriptive  analysis  of  the  different  features 
of  Web  browsing  behavior  from  clickstream  data  obtained 
through  a  user  study.  In  Section  4,  we  present  our  empirical 
analysis  on  authenticating  users  with  classifiers  trained  from 
individual  features  and  introduce  our  algorithm  for  an  en¬ 
semble  of  classifiers  trained  from  subsets  of  those  features. 
Our  conclusions  and  future  work  suggestions  are  in  Section 
5. 

2  Related  Work 

Marketers  have  long  been  interested  in  understanding  Web 
interaction  behavior  (Atterer,  Wnuk,  and  Schmidt  2006)  in 
order  to  design  Web  sites  that  entice  visitors  to  finish  their 
Web  session  with  a  checkout  of  their  shopping  cart.  Be¬ 
havioral  targeting  is  an  approach  used  by  advertisers  (e.g., 
Doubleclick)  that  tracks  Web  behavior  to  deliver  advertise¬ 
ments  which  match  an  individual’s  semantic  profile  defined 
by  content-related  preferences  and  interests.  Research  in 
this  area  has  concentrated  on  identifying  the  demographic 
characteristics  of  a  behavior  such  as  age,  gender,  and  in¬ 
come  rather  than  authenticating  a  single  individual  (Goel, 
Hofman,  and  Sirer  2012).  There  has  also  been  some  research 
on  understanding  online  browsing  behavior  from  an  aggre¬ 
gate  perspective  in  order  to  identify  influential  websites  in 
user  navigation  patterns  (Kumar  and  Tomkins  2010). 

In  contrast  to  semantic  patterns,  syntactic  patterns  charac¬ 
terize  Web  browsing  based  strictly  on  session  and  navigation 
features.  They  include  the  burstiness  of  pageviews,  the  num¬ 
ber  of  page  revisits,  and  the  number  of  pages  between  revis¬ 
its  (Kumar  and  Tomkins  2010).  As  an  illustration  of  bursti¬ 
ness,  it  was  noted  in  (Kumar  and  Tomkins  2010)  how  the 
inter-arrival  time  cumulative  distribution  between  any  visit 
to  a  particular  URL  and  the  previous  visit  fits  a  logarithmic 
function  across  users.  On  the  server  side,  visitors  are  inde¬ 
pendent  of  each  other  so  the  distribution  of  visits  can  follow 
a  Poisson  distribution.  It  is  not  clear  if  this  type  of  distribu¬ 
tion  also  fits  the  time  distribution  on  the  user  side  since  the 
webpages  visited  are  not  independent  from  each  other.  We 
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will  show  another  illustration  of  burstiness  on  the  user  side. 
In  addition,  the  length  of  a  session  (both  time  and  number  of 
pages  visited)  and  the  starting  time  and  day  of  the  week  also 
characterize  user  syntactic  patterns. 

The  attribution  problem  in  cyberspace  has  been  addressed 
in  several  ways  mainly  by  leveraging  from  features  in  the 
browser  (e.g.,  history  stealing,  cookies)  or  accessing  datasets 
containing  partially  identifying  information.  For  example, 
de-anonymization  in  social  networking  websites  has  been 
accomplished  by  computing  the  intersection  of  users  from 
group  memberships  in  a  social  network  using  information 
from  hyperlinks  in  the  browser  history  and  knowledge  about 
those  groups  (Wondracek  et  al.  2010).  In  general,  unique 
identification  is  possible  by  cross-referencing  independent 
information  sets  containing  partial  information  with  a  uni¬ 
versal  set  in  a  manner  equivalent  to  a  database  join  (also 
known  as  “linkage  attacks”).  For  example,  it  has  been  pos¬ 
sible  to  link  medical  records  to  individuals  in  voter  regis¬ 
tration  records  (Sweeney  1996).  Some  success  has  been  re¬ 
ported  with  the  classification  of  global  syntactic  features  of 
a  Web  session  (e.g.  length  of  session,  average  time  on  a 
page)  per  user  (Padmanabhan  and  Yang  2006)  aggregated 
over  several  sessions.  It  has  also  been  shown  that  author¬ 
ship  of  content  can  be  determined  from  stylometric  features 
on  an  internet  scale  threatening  anonymity  (Narayanan  et  al. 
2012)  but  this  type  of  attribution  depends  on  published  con¬ 
tent.  Research  in  predicting  user  behavior  in  cyberspace  has 
also  been  focused  on  improving  tasks  such  as  information 
retrieval  (Armstrong  et  al.  1995).  For  example,  based  on  the 
content  of  the  current  webpage  and  a  user’s  original  search 
keywords,  the  most  relevant  hyperlinks  in  the  page  are  high¬ 
lighted  to  guide  selection  of  the  next  page  to  visit.  This  type 
of  prediction  is  oriented  toward  the  information  presented 
in  context  to  the  user  rather  than  the  specific  activity  that  a 
user  might  pursue  (e.g.  send  an  email,  read  a  paper,  etc.). 
In  contrast  to  previous  approaches,  we  address  the  attribu¬ 
tion  problem  by  leveraging  both  from  syntactic  patterns  in 
Web  browsing  history  and  the  semantic  content  of  this  his¬ 
tory  with  the  genre  of  the  page. 

The  authentication  problem  has  been  addressed  in  the 
context  of  masquerade  detection  in  computer  security  by 
modeling  user  command  line  sequences.  In  the  masquerade 
detection  problem,  the  task  is  to  positively  identify  masquer¬ 
aders  but  not  to  positively  identify  a  particular  user.  Recent 
experiments  modeling  user-issued  OS  commands  as  bag-of- 
words  without  timing  information  have  obtained  a  72.7% 
true  positive  rate  and  a  6.3%  false  positive  rate  (Salem  and 
Stolfo  2010)  on  a  set  of  15000  commands  for  70  users 
grouped  in  sets  of  100  commands.  In  that  work,  a  one-class 
support  vector  machine  (SVM)  (Scholkopf  et  al.  2000)  was 
shown  to  produce  better  performance  results  than  threshold- 
based  comparison  with  a  distance  metric.  We  extend  the  re¬ 
sults  of  this  work  to  individual  feature  sets  of  Web  browsing 
behavior  and  in  combination  with  an  ensemble. 

3  Web  Browsing  Modeling 

Logging  of  spontaneous  clickstream  data  in  our  user  study 
consisted  of  recording  through  custom-built  browser  exten¬ 
sions  (Firefox  and  Chrome)  the  timestamp  and  the  URL  that 
was  visible  at  the  time  by  the  user  (i.e.,  pageview).  The 


data  was  parsed  offline  to  minimize  interference  with  the 
user.  Ten  subjects  (2  females  and  8  males)  participated  in 
this  study  during  the  course  of  their  work  for  one  month. 
For  clarity,  we  only  show  the  results  of  the  same  3  users 
in  our  figures.  The  population  was  fairly  homogeneous  and 
rated  themselves  highly  “Web  savvy.”  The  following  fea¬ 
tures,  which  are  detailed  later,  were  extracted  from  the  data: 
day-of-week,  time-of-day,  pauses  (below  5  mins),  burstiness 
(below  10  mins),  time  between  revisits,  and  genres  (i.e.  page 
types).  The  number  of  pageviews  per  user  varied  from  1200 
to  12000.  Web  browsing  behavioral  data  is  noisy  and  re¬ 
quires  some  pre-processing  for  analysis.  Noise  occurs  due 
to  distortion  from  the  network  behavior,  errors  in  accessing 
URLs,  and  automatic  page  insertion  in  the  browser.  Future 
work  will  mitigate  those  problems. 

The  clickstream  data  is  parsed  into  “sessions”  where  a 
session  is  defined  as  a  continuous  stream  of  pageviews  de¬ 
limited  by  pauses  greater  than  30  minutes  as  in  (Kumar  and 
Tomkins  2010).  The  number  of  sessions  for  our  users  var¬ 
ied  from  42  to  205.  The  length  of  a  session  averaged  from 
14  to  131  pageviews.  User  sessions  are  the  data  points  in 
our  study  of  Web  behavior.  We  distinguish  between  global 
session  features  and  internal  session  features  as  explained 
below. 

3.1  Global  Session  Features 

Standard  global  session  features  capture  characteristics  of 
a  session  across  pageviews.  They  include  day-of-week 
(DOW)  and  time-of-day  (TOD)  distributions.  Since  the  ad¬ 
vent  of  teleworking  and  flex  time,  these  features  are  not  uni¬ 
form  across  workers.  Figure  1  illustrates  three  users  and 
their  patterns  of  weekly  online  activity  aggregated  for  all 
sessions.  User  3  is  the  only  one  not  active  during  the  week¬ 
end.  Figure  2  shows  for  the  same  three  users  their  patterns 
of  hourly  online  activity  aggregated  across  all  sessions.  User 
2  is  mostly  active  in  the  morning  while  User  1  is  active  after 
dinner. 
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Figure  1 :  Daily  activity  patterns  for  three  users  aggregated 
across  all  sessions 

Other  global  session  features  in  our  empirical  study  in¬ 
clude  the  total  number  of  pageviews,  the  average  duration  of 
pageview,  and  the  number  of  unique  pageviews. 

3.2  Internal  Session  Features 

An  internal  session  feature  captures  characteristics  of 
pageviews  within  a  session. 


Hour  of  Day 

Figure  2:  Hourly  activity  patterns  for  three  users  aggregated 
across  all  sessions 
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Figure  4:  Burstiness  profile  below  1  min  aggregated  across 
all  sessions  for  three  users 


Pauses  Pauses  are  the  time  spent  by  the  user  on  a  web¬ 
page.  It  is  computed  as  the  difference  between  the  times¬ 
tamp  of  two  consecutive  pageviews.  Like  other  human 
activities,  pause  profiles  follow  the  power  law  distribution 
(Barabasi  2005).  Consequently,  we  can  fit  this  data  with 
an  exponential  function.  Figure  3  shows  the  exponential 
fit  of  pause  profiles  below  5  minutes  for  three  users.  This 
data  fit  function  can  be  used  to  obtain  the  probability  of  the 
next  pageview  and  act  as  a  signature  by  which  to  compare 
pause  distributions.  Differences  between  users  are  more  pro¬ 
nounced  for  shorter  pauses. 


Pause  profiles 


Figure  3:  Pause  profiles  below  5  mins  for  three  users  aggre¬ 
gated  across  all  sessions  truncated  to  the  first  5  seconds 

Burstiness  Burstiness,  as  a  characteristic  of  human  behav¬ 
ior,  follows  the  power  law  distribution.  In  (Barabasi  2005) 
burstiness  is  explained  as  a  consequence  of  our  decision  pro¬ 
cess  in  prioritizing  tasks.  It  is  computed  as  the  change  in 
pause  time  between  pageviews  or  second  order  pause  time 
(Kwok  2012).  While  burstiness  patterns  are  fairly  uniform 
across  users  for  longer  pause  changes,  they  can  be  quite  dif¬ 
ferent  for  shorter  pause  changes  as  illustrated  in  Figure  4. 

Time  between  revisits  How  often  is  a  webpage  revisited? 
Some  webpages  were  found  to  play  a  role  similar  to  stop 
words  in  a  sentence  (Montgomery  and  Faloutsos  2001).  The 
rate  at  which  webpages  are  revisited  may  also  serve  as  an 
indicator  of  user  identity.  The  revisit  rate  averaged  between 
28%  to  46%  among  our  users.  Figure  5  illustrates  the  time 
between  revisits  (under  6  mins)  profile  for  three  users.  There 
are  large  differences  mainly  in  the  shorter  intervals  of  time. 


Figure  5:  Time  between  revisits  (under  6  min.)  profile  for 
three  users  across  all  sessions  truncated  to  the  first  72s 

of  webpages  into  genres.  Genres  are  functional  categories 
of  information  presentation.  In  other  words,  genres  are  a 
mixture  of  style,  form,  and  content.  For  example,  books 
have  many  genres  such  as  mystery,  science-fiction,  fiction, 
and  biography.  Similarly,  webpages  have  evolved  their  own 
genres  (e.g,  blog,  homepage,  article).  Basically,  the  genre 
of  a  document  is  tied  to  its  purpose  and  reflects  social  con¬ 
ventions  for  disseminating  and  searching  information.  We 
claim  that  genres  are  more  indicative  than  topics  for  distin¬ 
guishing  Web  browsing  behavior.  For  example,  some  people 
are  more  frequent  visitors  of  discussion  forums  (e.g.  reddit) 
than  blogs  (e.g.  wordpress)  regardless  of  content.  However, 
genres  and  topics  do  combine  in  important  ways  (e.g.  spam 
is  a  combination  of  content  and  style). 

We  used  the  Diffbot  page  classifier  1  to  classify  pages  into 
genres.  Diffbot  is  a  web  service  that  currently  categorizes 
webpages  into  21  pages.  There  are  several  problems  in  us¬ 
ing  a  third  party  web  service  especially  one  that  is  in  beta 
mode.  Although  we  expect  that  the  quality  of  the  categoriza¬ 
tion  will  improve  as  Diffbot  matures,  the  main  problems  are 
certificate  errors  (some  of  which  could  be  resolved  internally 
by  loading  the  certificates  or  via  automatic  trust  configura¬ 
tion),  external  errors  (which  include  errors  that  a  user  could 
have  experienced),  errors  due  to  the  Web  service  itself  (10% 
of  all  accesses),  the  limitation  in  the  number  of  requests  per 
month,  and  control  over  the  page  types.  Figure  6  illustrates 
the  genre  profiles  for  three  users.  There  are  large  differences 
between  users  in  the  genre  of  pages  visited.  No  strong  lin¬ 
ear  correlation  was  found  between  genres  and  pauses  so  we 
can’t  infer  the  time  spent  on  a  webpage  from  its  genre. 


Genres  Encoding  is  necessary  to  obtain  reusable  patterns  _ 

of  behavior.  We  encode  the  semantic  and  stylistic  content  1  http://www.diffbot.com 


Figure  6:  Genre  profiles  for  three  users  (excluding  errors) 
aggregated  across  all  sessions 

4  Empirical  Study 

The  goal  of  this  study  is  to  verify  the  claim  that  users  can  be 
authenticated  from  their  Web  browsing  behavior.  All  exper¬ 
iments  were  conducted  in  the  Weka  machine  learning  work¬ 
bench  (Hall  et  al.  2009)  augmented  by  our  own  ensemble  al¬ 
gorithms.  We  extracted  the  features  of  Web  browsing  behav¬ 
ior  described  above  from  each  user  session  and  aggregated 
them  into  one  feature  vector.  A  user’s  dataset  consisted  of 
all  sessions  collected  for  that  user.  For  each  user,  we  com¬ 
pared  the  false  rejection  rate  (FRR)  (i.e.,  false  negative  rate) 
and  the  false  acceptance  rate  (FAR)  (i.e.,  false  positive  rate) 
for  classifiers  derived  from  each  feature  set  and  an  ensemble 
classifier  composed  of  classifiers  based  on  a  weighted  ran¬ 
dom  sample  of  those  features.  FRR  results  were  obtained 
using  cross-validation  on  the  user’s  dataset  while  FAR  re¬ 
sults  were  obtained  by  applying  the  classifier  obtained  on  a 
dataset  containing  the  data  of  all  the  other  users.  Note  that 
FRR  results  will  be  better  in  practice. 

4.1  One-Class  Classification 

One-class  classification  is  pertinent  in  the  context  of  classi¬ 
fication  with  only  positive  examples  where  negative  exam¬ 
ples  are  hard  to  come  by  or  do  not  fit  into  a  unique  cate¬ 
gory.  Some  applications  for  one-class  classification  include 
anomaly  detection,  fraud  detection,  outlier  detection,  au¬ 
thorship  verification  and  document  classification  where  cat¬ 
egories  are  learned  individually.  The  goal  of  one-class  clas¬ 
sification  is  to  detect  all  classes  that  differ  from  the  target 
class  without  knowing  them  in  advance.  One-class  classifi¬ 
cation  is  similar  to  unsupervised  learning  but  tries  to  solve 
a  discriminative  problem  (i.e.,  self  or  not  self)  rather  than  a 
generative  problem  as  in  clustering  algorithms  or  density  es¬ 
timation.  Several  algorithms  have  been  modified  to  perform 
one-class  classification.  We  used  a  one-class  SVM  avail¬ 
able  with  LibSVM  (Scholkopf  et  al.  2000)  as  part  of  the 
Weka  machine  learning  toolbench.  SVMs  are  large-margin 
classifiers  that  map  feature  vectors  to  a  higher  dimensional 
space  using  kernels  based  on  similarity  metrics.  The  opti¬ 
mization  objective  in  SVMs  is  to  find  a  linear  separating  hy¬ 
perplane  with  maximum  margin  between  class  boundaries. 
In  the  case  of  a  Gaussian  kernel,  a  non-linear  separating 
hyperplane  is  found  that  separates  the  class  boundaries.  A 
kernel  transforms  the  feature  space  using  a  similarity  mea¬ 
sure  to  “support”  vectors  (i.e.,  instances  close  to  decision 
boundaries)  maximizing  the  margin.  Formally,  let  x  and  x' 
be  two  feature  vectors  and  <f>  a  feature  mapping  function  to 


a  higher-dimensional  space,  a  kernel  function  k  is  defined 
as  k(x^x')  =  Since  the  number  of  features 

and  number  of  examples  (sessions)  for  each  user  is  rela¬ 
tively  small,  we  use  the  radial  basis  function  kernel  (Hsu 
et  al.  2003)  based  on  a  Gaussian  transform  of  the  feature 
space  with  default  parameters.  The  one-class  SVM  in  the 
LibSVM  library  simply  finds  a  separating  hyperplane  with 
respect  to  the  origin  as  a  support  vector  in  the  complement 
class. 

Table  1  shows  the  results  of  one-class  SVM  classification 
for  each  user  and  for  each  feature  set.  The  global  features 
consists  of  the  DOW  distribution,  the  TOD  distribution,  the 
number  of  pageviews,  the  number  of  unique  pageviews,  and 
the  average  duration  of  each  pageview  in  the  session.  For 
each  session,  pauses  (below  5  mins),  bursts  (below  10  mins), 
and  time  between  revisits  were  discretized  into  100  bins. 
All  feature  distributions  (DOW,  TOD,  pauses,  bursts,  revis¬ 
its,  and  genres)  were  normalized.  In  addition,  Each  feature 
was  scaled  in  the  [-1,1]  range  in  the  training  dataset  (i.e.,  the 
user’s  dataset).  FRR  results  are  obtained  with  10-fold  cross- 
validation  averaged  over  10  runs  while  FAR  results  are  ob¬ 
tained  by  applying  the  classifier  trained  on  the  entire  user 
dataset  to  the  data  of  the  other  users  applying  the  feature 
scaling  obtained  during  training  (Hsu  et  al.  2003).  Please 
note  that  FRR  results  should  be  better  in  practice. 

Figure  7  aggregates  the  results  of  Table  1 .  It  illustrates 
the  tug-of-war  between  FRR  and  FAR  outcomes  and  the  dif¬ 
ficulty  of  obtaining  good  results  for  authentication  metrics. 
An  increase  in  FRR  is  usually  accompanied  by  a  decrease  in 
FAR  and  vice  versa.  Genres  and  global  features  were  found 
to  be  good  at  differentiating  Web  browsing  behavior  (as  evi¬ 
denced  by  lower  FAR  rates)  while  pauses,  bursts,  and  revis¬ 
its  were  found  to  have  better  recognition  rates  (as  evidenced 
by  lower  FRR  rates).  However,  none  of  the  individual  fea¬ 
tures  are  good  enough  in  isolation  to  authenticate  a  user. 


Feature  Comparison 


Figure  7 :  Average  feature  set  results  comparison 

4.2  Ensemble  Learning 

Can  we  leverage  collectively  from  those  features  to  improve 
performance?  Accuracy  and  diversity  in  individual  classi¬ 
fiers  were  found  to  be  necessary  and  sufficient  conditions  for 
high-performing  ensemble  of  classifiers  (Dietterich  2000). 
Furthermore,  it  was  shown  that  ensemble  learning  does  not 
follow  Occam’s  razor  principle  stating  that  increased  com¬ 
plexity  decreases  generalization  accuracy  (Ho  1998).  En¬ 
semble  learning  varies  the  type  of  learner  or  the  type  of  input 
(e.g.,  the  set  of  instances  or  features)  to  achieve  diversity. 
For  example,  bagging  (Breiman  1996)  varies  the  set  of  in- 


Global  Features 

Pauses 

Bursts 

Revisits 

Genres 

#sessions 

FRR 

FAR 

FRR 

FAR 

FRR 

FAR 

FRR 

FAR 

FRR 

FAR 

User  1 

98 

50±1.07 

35±0.0 

50±0.78 

30±0.0 

52±1.40 

37±0.0 

47±0.00 

32±0.00 

50±1.88 

28±0.0 

User  2 

72 

52±1.63 

43±0.0 

51±0.94 

52±0.0 

50±0.97 

48±0.0 

48±0.94 

59±0.00 

48±4.43 

34±0.0 

User  3 

86 

55±1.71 

33±0.0 

52±0.97 

37±0.0 

52±0.67 

41±0.0 

54±1.13 

50±0.00 

62±6.18 

22±0.0 

User  4 

121 

56±1.37 

37±0.0 

51±1.25 

42±0.0 

51±1.16 

37±0.0 

50±0.92 

56±0.0 

49±2.79 

1±00 

User  5 

88 

59±1.59 

31±0.0 

50±0.87 

45±0.0 

50±0.70 

29±0.0 

45±0.00 

0±0.0 

55±2.26 

25±0.0 

User  6 

181 

53±1.07 

33±0.0 

50±0.53 

51±0.0 

50±0.53 

52±0.0 

51±0.00 

60±0.0 

51±0.97 

23±0.0 

User  7 

205 

55±0.92 

39±0.0 

51±0.57 

56±0.0 

51±0.52 

64±0.0 

51±0.84 

50±0.0 

51±0.79 

33±0.0 

User  8 

42 

64±2.78 

44±0.0 

58±1.35 

41±0.0 

53±1.26 

45±0.0 

53±1.26 

48±0.0 

55±3.27 

20±0.0 

User  9 

59 

60±1.41 

49±0.0 

55±0.97 

44±0.0 

51±1.66 

45±0.0 

53±2.27 

36±0.0 

54±2.51 

47±0.0 

User  10 

44 

62±2.40 

27±0.0 

53±2.13 

43±0.0 

57±0.0 

48±0.0 

51±1.03 

60±0.0 

52±3.02 

20±0.0 

Avg 

99 

56.5 

37.1 

52.1 

44.1 

51.7 

44.6 

50.3 

45.1 

52.7 

25.5 

Table  1:  FRR  and  FAR  results  (given  as  percentages)  obtained  with  a  one-class  SVM  classifier  for  each  feature  set  for  each 
user.  FRR  results  are  averaged  over  10  runs. 


stances,  while  the  random  subspace  method  varies  the  set  of 
features  (Ho  1998).  We  use  the  random  subspace  method  to 
vary  the  input  features  of  one-class  SVMs.  Two-fold  cross- 
validation  on  the  training  set  evaluates  the  weight  of  a  classi¬ 
fier  used  to  combine  the  decisions  of  the  different  classifiers 
(i.e.,  self  or  not  self)  in  a  weighted  vote.  Choosing  accurate 
classifiers  is  problematic  here  since  it  is  easy  to  overfit  in 
the  one-class  classification  problem  as  a  classifier  choosing 
a  class  (self)  at  random  could  achieve  perfect  accuracy!  To 
overcome  this  problem  and  address  the  diversity  issue,  we 
select  a  subset  of  the  classifiers  with  weighted  sampling.  We 
train  a  fixed  number  of  classifiers  (300)  each  with  a  random 
subset  of  features  (5)  as  a  pool  of  classifiers  to  choose  from. 
A  fixed  number  of  classifiers  (107)  were  then  selected  from 
this  pool  for  our  ensemble.  These  parameters,  number  of 
classifiers,  number  of  features  and  pool  size,  were  selected 
empirically  for  good  performance  on  User  1  without  adjust¬ 
ment  for  the  other  users.  Future  work  will  select  a  variable 
number  of  features.  Pauses  and  time-between-re visit  distri¬ 
butions  were  truncated  to  the  first  20  bins  to  prevent  spurious 
features  due  to  sparsity  in  the  data.  Algorithm  1  describes 
our  methodology.  Table  2  compares  the  random  subspace 
method  with  a  mixture  of  experts  ensemble  where  the  deci¬ 
sion  of  the  classifiers  trained  on  the  individual  feature  sets 
are  combined  into  a  weighted  vote  using  a  similar  method¬ 
ology. 


Mixture  of 
Experts 

Random 

Subspace 

FRR 

FAR 

FRR 

FAR 

User  1 

50±0.66 

30±0.0 

18±2.20 

7±6.70 

User  2 

56d=1.83 

48±0.0 

33±7.14 

7±9.20 

User  3 

54±2.27 

32±0.0 

41±4.05 

10±8.84 

User  4 

53d=1.87 

35±0.0 

28±3.56 

5±5.64 

User  5 

49=bl.35 

16±0.0 

22±3.26 

19=L8.84 

User  6 

50±1.18 

43±0.0 

35±4.28 

lld=8.13 

User  7 

53±0.53 

50±0.0 

32=1=4. 11 

26±10.47 

User  8 

55U0.84 

37±0.0 

44±8.99 

13=bll.27 

User  9 

50=bl.08 

39±0.0 

27=b6.11 

15=bl0.05 

User  10 

55±2.37 

37±0.0 

32±6.61 

15±7.68 

Avg 

52.5 

36.7 

31.2 

12.8 

Table  2:  FRR  and  FAR  results  obtained  by  ensemble 
methods  of  one-class  SVM  classifiers  with  weighted  vote 
scheme. 


Algorithm  1  Random  subspace  ensemble  learning  training 
methodology  where  instances  are  the  training  instances,  P 
is  the  pool  size,  cl  the  classifier  algorithm,  /  the  number  of 
features,  and  n  the  ensemble  size  (n  <  p ). 

BUILDCLASSIFIER(instances,  cl,  P,  f,  n) 
features  <—  instances. features 
FOR  i  =  0  to  P 

FOREACH  feature  in  features 
feature. weight  random 

END 

fsubset  weight_sampling  (features,  f) 

//  Transform  instances 

flnsts  filter  (instances,  fsubset) 

//  Two-fold  cross-validation 
eval  cross-validate(cl, 2, flnsts) 

model  train-classifier  (cl, flnsts) 

model.weight  <—  eval 
model  [i]  <—  model 
END 

models  weight_sampling  (model,  n) 

RETURN  models 
END 


The  random  subspace  method  further  increases  the  bias  of 
the  classifiers  by  restricting  the  amount  of  features  which  in 
turn  reduces  overfitting,  a  major  source  of  classification  er¬ 
rors.  There  is  a  clear  linear  relationship  between  FRR  results 
and  the  ensemble  size  (i.e.,  the  number  of  selected  learners 
from  the  pool)  (Fig.  8).  FAR  results  depend  both  on  the 
ensemble  size  and  the  pool  size  (Fig.  9).  There  is  a  signifi¬ 
cant  performance  difference  (p  <  0.05)  between  FAR  results 
from  our  random  subspace  ensemble  learning  method  and 
from  the  mixture  of  experts  method  except  for  User  8  which 
recorded  the  least  number  of  sessions.  There  is  a  signifi¬ 
cant  difference  in  FRR  results  between  the  two  methods  for 
half  the  users,  which  suggests  that  some  adjustments  in  the 
parameters  for  specific  users  might  be  required. 

5  Conclusion 

Authentication  is  important  in  scaling  up  the  attribution  of 
Web  behavior  to  large  number  of  users.  Our  experiments 
have  shown  that  although  the  individual  features  of  Web 
browsing  behavior  are  not  individually  or  collectively  strong 
enough  to  authenticate  and  distinguish  users,  our  random 


FRR  Sensitivity  in  Ensemble  Learning 


Figure  8:  Sensitivity  between  FRR  results  and  ensemble  size 
with  pool  size  of  300  and  5  features  for  User  1 

FAR  Sensitivity  in  Ensemble  Learning 


Figure  9:  Sensitivity  between  FAR  results,  ensemble  size 
and  pool  size  in  random  subspace  ensemble  with  5  features 
for  User  1 

subspace  method  for  ensemble  learning  can  dramatically  im¬ 
prove  those  results.  Future  work  will  include  additional  fea¬ 
tures  as  well  as  the  exploration  of  additional  one-class  learn¬ 
ers.  Other  research  issues  include  extending  our  methodol¬ 
ogy  to  group  profiles. 
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